Core: Add InternalData read and write builders #12060

rdblue · 2025-01-23T00:50:04Z

This adds InternalData with read and write builder interfaces that can be used with Avro and Parquet by passing a FileFormat. Formats are registered by calling InternalData.register with callbacks to create format-specific builders.

The class is InternalData because registered builders are expected to use the internal object model that is used for Iceberg metadata files. Using a specific object model avoids needing to register callbacks to create value readers and writers that produce the format needed by the caller.

To demonstrate the new interfaces, this PR implements them using both Avro and Parquet. Parquet can't be used because it would fail at runtime until #11904 is committed (it is also missing custom type support).

Avro is working. To demonstrate that the builders can be used for metadata, this updates ManifestWriter, ManifestReader, and ManifestListWriter to use InternalData builders. It was also necessary to migrate the metadata classes to extend StructLike for the internal writers instead of IndexedRecord.

rdblue · 2025-01-23T00:55:53Z

core/src/main/java/org/apache/iceberg/InternalData.java

+      DynMethods.StaticMethod registerParquet =
+          DynMethods.builder("register")
+              .impl("org.apache.iceberg.parquet.Parquet")
+              .buildStaticChecked();


This uses DynMethods to call Parquet's register method directly, rather than using a ServiceLoader. There is no need for the complexity because we want to keep the number of supported formats small rather than plugging in custom formats.

I'm also considering refactoring so that the register method here is package-private so that no one can easily call it.

I do not understand this one, why can we call Avro Register() but not Parquet.register(). I'm also not clear on the Service Loader comment, is that just to note we don't want to make this dynamic and only want hardcoded formats to be supported?

This is due to the gradle project level isolation. Avro is currently included in core, but Parquet is in a separate subproject. I'm in favor of being explicit about what is supported (i.e. hard-coded), but we would like to keep parquet in a separate project to reduce dependency proliferation from api/core.

What about using the Java ServiceLoader to load the internal readers and writers?

I have created a WIP for testing out how the DataFile readers could work: #12069

Fist I implemented using the registry method like on this PR (7e171cc), then moved to the ServiceLoader method.

In my head the 2 problems are very similar

The ServiceLoader framework is error prone and commonly broken because of Jar bundling. In addition, we do not want anything else registered so it is not needed. That would make it easier to plug in here, which we specifically are trying to avoid.

I do not understand this one, why can we call Avro Register() but not Parquet.register(). I'm also not clear on the Service Loader comment, is that just to note we don't want to make this dynamic and only want hardcoded formats to be supported?

Agree with @RussellSpitzer here. Maybe better to add a comment about it.

rdblue · 2025-01-23T00:58:24Z

core/src/main/java/org/apache/iceberg/V1Metadata.java

-    public org.apache.avro.Schema getSchema() {
-      return AVRO_SCHEMA;
+    public int size() {
+      return MANIFEST_LIST_SCHEMA.columns().size();


Avro schemas are no longer needed when using StructLike rather than IndexedRecord.

rdblue · 2025-01-23T00:59:18Z

core/src/main/java/org/apache/iceberg/V1Metadata.java

    private DataFile wrapped = null;

-    IndexedDataFile(org.apache.avro.Schema avroSchema) {
-      this.avroSchema = avroSchema;
-      this.partitionWrapper = new IndexedStructLike(avroSchema.getField("partition").schema());


There is also no need for a wrapper to adapt PartitionData to IndexedRecord because it is already StructLike.

Big fan of this change

rdblue · 2025-01-23T01:00:13Z

core/src/main/java/org/apache/iceberg/avro/Avro.java

@@ -90,14 +103,18 @@ private enum Codec {
  }

  public static WriteBuilder write(OutputFile file) {
+    if (file instanceof EncryptedOutputFile) {


Encryption is handled by adding this. The read side already has a similar check.

rdblue · 2025-01-23T01:01:08Z

core/src/main/java/org/apache/iceberg/avro/InternalReader.java

@@ -76,6 +76,15 @@ public void setSchema(Schema schema) {
    initReader();
  }

+  @Override
+  public void setCustomTypes(


Because the InternalReader is no longer created by ManifestReader, the custom types now need to be passed to the read builder and forwarded to the reader here. Custom type support will need to be implemented for Parquet as well.

rdblue · 2025-01-23T01:02:11Z

core/src/main/java/org/apache/iceberg/avro/SupportsCustomRecords.java

@@ -20,7 +20,7 @@

 import java.util.Map;

-/** An interface for Avro DatumReaders to support custom record classes. */
+/** An interface for Avro DatumReaders to support custom record classes by name. */


This is used to distinguish between Iceberg custom types that used field IDs and StructLike and Avro's, which worked by renaming Avro records to class names that would be dynamically loaded.

rdblue · 2025-01-23T01:02:27Z

parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java

+  }
+
+  private static WriteBuilder writeInternal(OutputFile outputFile) {
+    return write(outputFile);


This will be where the internal object model is injected for Parquet.

rdblue · 2025-01-23T01:03:02Z

parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java

@@ -1171,6 +1188,16 @@ public ReadBuilder withNameMapping(NameMapping newNameMapping) {
      return this;
    }

+    @Override
+    public ReadBuilder setRootType(Class<? extends StructLike> rootClass) {
+      throw new UnsupportedOperationException("Custom types are not yet supported");


When the internal object model is complete, this should be implemented to instantiate the expected metadata classes while reading.

core/src/main/java/org/apache/iceberg/InternalData.java

parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java

RussellSpitzer · 2025-01-23T17:31:49Z

core/src/main/java/org/apache/iceberg/InternalData.java

+      registerParquet.invoke();
+
+    } catch (NoSuchMethodException e) {
+      // failing to load Parquet is normal and does not require a stack trace


This if normal for now? Don't we expect this to be failing bug in the future? I'm also a little interested in when we would actually fail here if we are using the Iceberg repo as is.

I wouldn't say that it's normal to fail. I'm actually not aware of any situations where the api/core modules are used but parquet isn't included. I think in almost all scenarios, it'll be available.

This would be normal whenever the iceberg-parquet module isn't in the classpath. For instance, the manifest read and write tests that are currently using InternalData in this PR hit this but operate normally because Parquet isn't used.

This still feels a bit odd. If it only can plausibly occur during tests I'm not sure we should even log this? We would still get errors at runtime if you attempt to get a reader or writer if Parquet isn't loaded.

RussellSpitzer · 2025-01-23T17:44:43Z

core/src/main/java/org/apache/iceberg/InternalData.java

+      return writeBuilder.apply(file);
+    }
+
+    throw new UnsupportedOperationException(


nit: Personally I think it may be a bit clearer to extract the handling the missing writer/reader

maybe have

writerFor(File format) { writer = WRITE_BUILDERS.get(format) if (writer == null) { throw new Unsupported Exception } else { return writer; } }

So that this code is just

return writerFor(format).apply(file)

Mostly I feel a little unease about the implicit else in the current logic so having an else would also make me feel a little better

RussellSpitzer · 2025-01-23T17:49:29Z

core/src/main/java/org/apache/iceberg/InternalData.java

+    WriteBuilder meta(String property, String value);
+
+    /**
+     * Set a file metadata properties from a Map.


Suggested change

* Set a file metadata properties from a Map.

* Set file metadata properties from a Map.

RussellSpitzer · 2025-01-23T17:52:12Z

core/src/main/java/org/apache/iceberg/InternalData.java

+    /**
+     * Set a writer configuration property.
+     *
+     * <p>Write configuration affects writer behavior. To add file metadata properties, use {@link


Suggested change

* <p>Write configuration affects writer behavior. To add file metadata properties, use {@link

* <p>Write configuration affects this writer's behavior. To add metadata properties to the written file use {@link

?

"This" doesn't refer to a writer. This is configuring a builder that creates the writer, so I think that the existing language is correct.

I guess I just find these two apis a little confusing. One is properties for the writer we are building, and one is for metadata for files created by that writer. It's probably clear enough though since I think I have it straight

This is alignment with the existing writer APIs. We have this for quite some time already, so I think we should stick to it unless we have strong reasons to change. The change in the javadoc helps too

RussellSpitzer · 2025-01-23T17:55:46Z

core/src/main/java/org/apache/iceberg/InternalData.java

+    ReadBuilder reuseContainers();
+
+    /** Set a custom class for in-memory objects at the schema root. */
+    ReadBuilder setRootType(Class<? extends StructLike> rootClass);


I'll probably get to this later in the PR but i'm interested in why we need this and setCustomType

Ok I see how it's used below, I'm wondering if instead of needing this, could we just automatically set these readers based on the root type? Ie

setRootType(ManifestEntry) --- Automatically sets field types based on Manifest entry?

Or do we have a plan for using this in a more custom manner in the future?

The problem this is solving is that we don't have an assigned ID for the root type. We could use a sentinel value like -1, but that could technically collide. I just don't want to rely on setCustomType(ROOT_FIELD_ID, SomeObject.class).

RussellSpitzer · 2025-01-23T18:45:34Z

From a release prospective, should we merge this post 1.8? Just thinking we probably want it in the build for a bit before we ship it? I know that partition stats is downstream of this so that is a dependency to consider but i'm not sure we can get that all together rapidly if we want to do this in the next week or so.

rdblue · 2025-01-23T18:58:33Z

From a release prospective, should we merge this post 1.8? Just thinking we probably want it in the build for a bit before we ship it? I know that partition stats is downstream of this so that is a dependency to consider but i'm not sure we can get that all together rapidly if we want to do this in the next week or so.

I agree. There's no need to target this for 1.8, especially when it isn't clear that the Parquet internal object model will make it. I just wanted to get this out for discussion since we are currently blocked on creating Parquet metadata files until we either merge Parquet into core or implement something like this.

pvary · 2025-02-12T11:54:37Z

core/src/main/java/org/apache/iceberg/InternalData.java

+     * @param value config value
+     * @return this for method chaining
+     */
+    WriteBuilder set(String property, String value);


Is it intentional that we have meta(Map), but we don't have set(Map)?

meta(Map) is used in the code paths that call this. We can also plumb through setAll but I was going for a minimal set of methods here.

RussellSpitzer · 2025-02-12T19:10:50Z

core/src/main/java/org/apache/iceberg/V1Metadata.java

-      this.avroSchema = AvroSchemaUtil.convert(entrySchema(partitionType), "manifest_entry");
-      this.fileWrapper = new IndexedDataFile(avroSchema.getField("data_file").schema());
+    ManifestEntryWrapper() {
+      this.size = entrySchema(Types.StructType.of()).columns().size();


nit: We have an EmptyStructLike we could use here. Although maybe we should just have "of()" always return a singleton.

Thinking more maybe this should just be a constant? Feels like we are doing a lot of work to get the number of elements in the datafile schema.

This is a Type, so I don't think EmptyStructLike would work. We can definitely have a special case for an empty schema though.

RussellSpitzer · 2025-02-12T19:19:10Z

core/src/main/java/org/apache/iceberg/V2Metadata.java

-    IndexedManifestEntry(Long commitSnapshotId, Types.StructType partitionType) {
-      this.avroSchema = AvroSchemaUtil.convert(entrySchema(partitionType), "manifest_entry");
+    ManifestEntryWrapper(Long commitSnapshotId) {
+      this.size = entrySchema(Types.StructType.of()).columns().size();


Same comment as above, just seems like we are doing a lot of work to get a constant

I know what you mean, but I didn't want this to get out of sync with entrySchema. Still calling that method seemed like the safest path.

rdblue · 2025-02-13T19:36:35Z

Merging this now that the 1.8.0 vote has passed. Thanks for reviewing, everyone!

github-actions bot added parquet core labels Jan 23, 2025

rdblue requested review from RussellSpitzer, danielcweeks and aokolnychyi January 23, 2025 00:50

rdblue commented Jan 23, 2025

View reviewed changes

danielcweeks reviewed Jan 23, 2025

View reviewed changes

core/src/main/java/org/apache/iceberg/InternalData.java Outdated Show resolved Hide resolved

danielcweeks reviewed Jan 23, 2025

View reviewed changes

parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java Outdated Show resolved Hide resolved

RussellSpitzer reviewed Jan 23, 2025

View reviewed changes

rdblue added 6 commits January 25, 2025 13:51

Core: Add InternalData read and write builders.

afd08b1

Use InternalData in ManifestWriter and ManifestListWriter.

1262417

Implement Parquet registration for InternalData.

3835e04

Apply spotless.

183cd19

Make Parquet.register() static.

3373d06

Parquet: Use InternalReader and InternalWriter for InternalData.

4127a4d

rdblue force-pushed the internal-add-internal-data-builders branch from ef05f9c to 01a9848 Compare January 27, 2025 22:39

Update registration to package-private.

acb5778

rdblue force-pushed the internal-add-internal-data-builders branch from 01a9848 to acb5778 Compare January 27, 2025 22:55

danielcweeks mentioned this pull request Jan 29, 2025

Core: Relocate parquet to core #11716

Closed

rdblue added 5 commits February 6, 2025 16:39

Apply spotless.

799914b

Add missing @OverRide annotations.

f432b2b

Fix errorprone issues.

c3b8045

Core: Update v3 data manifests writer to use InternalData.

0c25b38

Parquet: Add missing private constructor for util class.

952e89b

rdblue force-pushed the internal-add-internal-data-builders branch from c651737 to 952e89b Compare February 11, 2025 21:47

pvary reviewed Feb 12, 2025

View reviewed changes

RussellSpitzer reviewed Feb 12, 2025

View reviewed changes

RussellSpitzer approved these changes Feb 12, 2025

View reviewed changes

rdblue merged commit b8fdd84 into apache:main Feb 13, 2025
46 checks passed

	* Set a file metadata properties from a Map.
	* Set file metadata properties from a Map.

	* <p>Write configuration affects writer behavior. To add file metadata properties, use {@link
	* <p>Write configuration affects this writer's behavior. To add metadata properties to the written file use {@link

Core: Add InternalData read and write builders #12060

Core: Add InternalData read and write builders #12060

Conversation

rdblue commented Jan 23, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RussellSpitzer commented Jan 23, 2025

rdblue commented Jan 23, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue commented Feb 13, 2025